Red Hat Enterprise Linux 7 Troubleshooting

Being Proactive, Part 1

Module Topics

Being Proactive
Monitoring: Centralized Logging
Monitoring: Hard Drive Failures
Baselining: Using aide
Baselining: Using sar
Network Monitoring

Being Proactive

Steps that you should take before a problem occurs:

Monitoring systems
Baselining systems
Managing multiple versions of configuration files
Writing a disaster recovery plan

Being Proactive

Support Contracts

Support contracts for critical systems are essential.
Most large software and hardware vendors offer a range of support options.
Before you purchase a support contract, check what coverage you will actually get.
If support contracts are not enough, you may also need to keep spare hardware on-site.
- If possible, configure spare hardware with automatic failover to minimize downtime.
- This is called a "hot swap."
- You may also want to keep a spare on-site for cold swap components.
Track warranties on hardware, as it may help you obtain spares quickly and efficiently.
- Replace disks that are out of warranty.
- Avoid using disks that are out of warranty for critical data.
- Have a replacement plan in place, including funding and migration plans.
- Know what to do when a warranty expires.

== m02p03_support_contracts

Support contracts for critical systems are essential to have in place. Most large software and hardware vendors offer a range of support from web-based ticketing systems to 24x7x365 support plans for mission critical apps and hardware. Before you take out a support contract, check what coverage you will actually get. Often a guaranteed 4 hour response time does not mean an engineer turning up on site, but merely a response via phone or email to the initial ticket.

In some cases support contracts are not enough, and you may also need to keep spare hardware on-site. Disk systems (such as NAS and SAN arrays) often allow you to configure the spare hardware with automatic failover, which enables you to swap out the faulty drive or component without any downtime. This is known as a "hot swap." You may also want to keep a spare on-site for cold swap components. If a hardware failure occurs, you can use the spare until the original hardware has been repaired or replaced. Because of the low cost of x86-based hardware and Linux-based operating systems, this has become a viable alternative to maintaining expensive proprietary hardware with expensive support contracts.

Tracking warranties on hardware is also extremely useful as it may help with obtaining spares quickly and efficiently. This is especially important with the warranties on hard disks. Disks out of warranty should be replaced or at least not used for critical data. Have a replacement plan in place, including funding and migration plans, so that you know what to do when a warranty expires.

Being Proactive

Documentation

Maintain documents that fully outline and identify the following for your organization:

Hardware
Software
Configuration settings for each component

Being Proactive

Documentation

[student@server1 ~]$ man -k passwd
checkPasswdAccess (3) - query the SELinux policy database in the kernel.
chpasswd (8)          - update passwords in batch mode
ckpasswd (8)          - nnrpd password authenticator
fgetpwent_r (3)       - get passwd file entry reentrantly
getpwent_r (3)        - get passwd file entry reentrantly
...
passwd (1)            - update user's authentication tokens
sslpasswd (1ssl)      - compute password hashes
passwd (5)            - password file
passwd.nntp (5)       - Passwords for connecting to remote NNTP servers
passwd2des (3)        - RFS password encryption
...

Being Proactive

Documentation

Most other documentation is found in the /usr/share/doc/ directory, in subdirectories named by the RPM package.
If it is not a man page, not an info page, and not part of the GNOME help utility, it is stored here.
Many applications have their documentation packaged in a separate RPM package, which may or may not be installed. In Red Hat Enterprise Linux 7, these packages are often found in the Optional tree.
To locate the documentation supplied with an RPM package:
- Use rpm -qd package to list all files flagged %doc.
- Use rpm -qc package to list all configuration files distributed in the package.

References

man(1) and rpm(8) man pages
/usr/share/doc/packagename/

== m02p06_documentation_3

By convention, most other documentation is found in the /usr/share/doc/ directory, in subdirectories named by the RPM package. The /usr/share/doc/ directory is used to collect "everything else." If it is not a man page, not an info page, and not part of the GNOME help utility, it is stored here.

The documentation directory for the zip utility, for example, tells you the compression algorithm, and little else. This is not much help to the administrator. The samba-* directory, however, includes many useful documents.

Many applications have their documentation packaged in a separate RPM package, which may or may not be installed. An example is the bash-doc package. In Red Hat Enterprise Linux 7, these packages are often found in the Optional tree.

A last (or first) resort to locate the documentation supplied with an RPM package is to use the rpm -qd package command, which lists all files flagged %doc.

Another useful command is rpm -qc package, which lists all configuration files distributed in the package.

Monitoring: Centralized Logging

Information gathering is one of the most important phases of troubleshooting.
Log files, kernel output, and device output can all help you diagnose your system more quickly.
Knowing how to order and search output is essential in troubleshooting.
Commands such as grep, uniq, sort, and less are fundamental to finding errors and identifying problems.
If possible, compare logs and output with a similar healthy system to locate relevant error messages.
Once you locate the errors, you can fix the problem, and then test.

== m02p07_monitor_cent_logging_1

In this section you will learn to configure the system to receive log messages from other systems and to configure the system to forward log messages to a central log server.

Information gathering is one of the most important phases of troubleshooting. If you do not know what is wrong with a system it can be hard to fix it. Luckily your system provides you with a large amount of information, if you only look for it. Log files, kernel output, and device output can all help you diagnose your system more quickly; however all this information can be overwhelming and often real problems are hidden or lost in the reams of normal output your system generates. Knowing how to order and search output is essential in troubleshooting. Being able to use commands such as grep, uniq, sort, and less is fundamental to finding errors and identifying problems. If possible, comparing logs and output with a similar healthy system can help you locate the relevant error messages much more quickly. Once you locate the errors, you can fix the problem, and then test.

Monitoring: Centralized Logging

Good logging practices are prerequisites to effective troubleshooting.
Ensure that syslog is running and configured to log information from important services on all systems.
Increase the loglevel to aid with troubleshooting. (For example, from info to debug.)
Ensure that important messages are forwarded to a central log server, perhaps one that is proactively watching the events to notify you of pending failures.
Red Hat Enterprise Linux 7 uses rsyslog for event logging, an enhanced syslog daemon providing support for both UDP and TCP transport, failover destinations, and queued operations.
- /etc/rsyslog.conf contains numerous comments.
- See /usr/share/doc/rsyslog-*/ for more info

== m02p08_monitor_cent_logging_2

Good logging practices are prerequisites to effective troubleshooting. Arriving at a broken system only to find little or nothing in the way of logs is frustrating and will slow down the process. Ensure that syslog is running and configured to log information from important services on all systems. Also, increasing the loglevel (from info to debug for example) to aid with troubleshooting is often useful. Lastly, ensure that important messages are forwarded to a central log server, perhaps one that is proactively watching the events to notify you of pending failures.

Red Hat Enterprise Linux 7 uses rsyslog for event logging, an enhanced syslog daemon providing support for both UDP and TCP transport, failover destinations, and queued operations.

The configuration file, /etc/rsyslog.conf contains numerous comments. Additional documentation can be found in the /usr/share/doc/rsyslog-*/ directory.

Monitoring: Centralized Logging

Configuring a Server to Accept Remote Log Messages Using UDP

Uncomment the following lines in /etc/rsyslog.conf:
```
$ModLoad imudp
$UDPServerRun 514
```

Restart the service:

[root@server1 ~]# systemctl restart rsyslog

Open the host firewall for inbound port 514/UDP and/or TCP

Monitoring: Centralized Logging

Forwarding Messages via UDP to a Central Log Server

Decide on the types of messages (facility and priority) and the name or IP address of the central log server.
Add a line similar to the following to /etc/rsyslog.conf:
```
*.info      @server1
```

Restart the service:

[root@desktop1 ~]# systemctl restart rsyslog

Test the forwarding rule with the logger command:

[root@desktop1 ~]# logger "Hello from desktop1"
[root@desktop1 ~]# tail /var/log/messages
Jan 18 14:24:37 desktop1 root: Hello from desktop1
[root@server1 ~]# tail /var/log/messages
Jan 18 14:24:37 desktop1 root: Hello from desktop1

References

Viewing and Managing Log Files
rsyslog.conf(5) and logger(1) man pages

Monitoring: Hard Drive Failures

Hard drives die. It is not a question of if a drive will die but rather when.
If you know that a drive is dying, you can plan for its replacement instead of responding to an emergency.
SMART = Self-Monitoring, Analysis and Reporting Technology
- SMART is built-in to almost all modern hard drives.
- In Red Hat Enterprise Linux systems, the smartd SMART-daemon polls all of the hard drives every 30 minutes. ** If smartd sees that a drive is dying, it issues a message to /var/log/messages and sends an email message to the root user on the local system.
- You can specify an alternate, centralized email address in /etc/smartmontools/smartd.conf.

== m02p11_monitor_detect_hd_fail_1

In this section you will learn to use SMART to identify hard drive failures.

Hard drives die. It is not a question of if a drive will die but rather when. If you know that a drive is dying, you can plan for its replacement instead of having to respond to an emergency call at 4 a.m. This is where SMART comes in. SMART means Self-Monitoring, Analysis and Reporting Technology, and it is a feature built-in to almost all modern hard drives.

On your Red Hat Enterprise Linux system, there are multiple ways to work with SMART. The first is the SMART-daemon called smartd. smartd polls all of the hard drives every 30 minutes, and if it sees that a drive is dying, it issues a message to /var/log/messages. smartd also sends an email message to the root user on the local system, but an alternate, centralized email address can be specified in /etc/smartmontools/smartd.conf.

Monitoring: Hard Drive Failures

Another method of talking to a SMART-enabled drive is with the smartctl tool.

One method of using smartctl is to ask for only the overall health status:

[root@server1 ~]# smartctl -H /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-123.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

For more detailed information, query all the individual counters: smartctl -a /dev/sda. The column you are interested in is RAW_VALUE.
To tell the drive to perform a test immediately, use smartctl -t testtype /dev/sda, where testtype is either offline, long, or short.
To view the output of a selftest, (long, short), run smartctl -l selftest /dev/sda.
To get the output of the offline test or the errors from any other test, run smartctl -l error /dev/sda.

Reference

smartd(8), smartd.conf(5), and smartctl(8) man pages

Baselining: Using AIDE

Good baseline monitoring of systems is extremely helpful when troubleshooting.

Compare when a system appears to be behaving erratically
Report when a system is operating outside of specified parameters.
Tighten security.
Build trends for your systems and networks over time.
Use trends to spot events outside of the norm.
Deciding what to monitor depends on the work that a system does.
- For database servers or file servers, disk space, service availability, and load might be important.
- For a desktop system, you might just check to see that it is running.
Long-term monitoring can be used to:
- Measure growth of system load over time
- Predict when a new server or file store is required
- Measure how improvements impact service availability and help work flow and numerous other issues

== m02p13_baseline_use_aide_1

In this section you will learn to configure AIDE to track file system changes.

Good baseline monitoring of systems is extremely helpful when troubleshooting. A good baseline of system activity and use can be used to compare when a system appears to be behaving erratically, or more actively, to report when a system is operating outside of specified parameters.

It can also be used to tighten security. As you monitor, you can build up trends for your systems and networks over time. Using these, you can more easily spot events outside of the norm, which could be attempts to gain access to your systems, or a rogue system already under the control of external influences.

Deciding what to monitor is dependent on the work that a system does. For database servers or file servers, disk space, service availability and load might be important. For a desktop system, you might just check to see that it is up and running. Data gained from longer term monitoring can be used outside of the purely technical. You can use it to measure the growth of system load over time and to predict when a new server or file store might be required. You can use it to measure how improvements are impact service availability and therefore help work flow and numerous other issues.

Baselining: Using AIDE

AIDE = Advanced Intrusion Detection Environment
AIDE is a tool to check the integrity of files on the system.
When the system is in a known good state, it is used to scan the system and collect information about installed d files:
- Checksums
- Permissions
- Other characteristics
Information is placed in a database file which can be stored offline.
Use AIDE to compare the state of the system against the stored database and check for any changes.

Baselining: Using AIDE

Steps to Deploy AIDE

The following is an example of deploying AIDE on server1.

Install the aide package.

[root@server1 ~]# yum install -y aide
... Output omitted ...

Customize /etc/aide.conf to your liking.

Example

@@define DBDIR /var/lib/aide (1)
@@define LOGDIR /var/log/aide

database=file:@@{DBDIR}/aide.db.gz (2)
database_out=file:@@{DBDIR}/aide.db.new.gz (3)
gzip_dbout=yes
report_url=file:@@{LOGDIR}/aide.log (4)
report_url=stdout

# R is short for p+i+n+u+g+s+m+c+acl+selinux+xattrs+md5
NORMAL = R+rmd160+sha256 (5)
PERMS = p+i+u+g+acl+selinux

/ NORMAL (6)
!/etc/.*~
/root/..* PERMS

1	Defines macros that can be used in `/etc/aide.conf`.
2	Configuration directive defining the location of the AIDE database. Note that this example uses a macro defined above.
3	Configuration directive defining the location in which `aide --init` will save a newly created database file.
4	Where the results of `aide --check` will be reported. Note that multiple locations are allowed.
5	Group definition line. Files selected by AIDE in group `NORMAL` will store information about its regular permissions, inodes, number of links, user and group, size, mtime and ctime, POSIX ACLs, SELinux context, extended attributes, MD5 checksum, RMD160 checksum, and SHA256 checksum.
6	Selection lines. The first one adds all files under `/` to be checked in group `NORMAL`; the second exempts all files in `/etc` that end in `~` from being checked; the third specifies that all files under `/root` that start with a period `(.)te` should be checked in group `PERMS` only. Note that this uses regular expression syntax.

Run /usr/sbin/aide --init to build the initial database. This can take a while as it creates a gzipped-database called /var/lib/aide/aide.db.new.gz.
```
[root@server1 ~]# aide --init

AIDE, version 0.15.1

### AIDE database at /var/lib/aide/aide.db.new.gz initialized.
```
Store /etc/aide.conf, /usr/sbin/aide and /var/lib/aide/aide.db.new.gz in a secure location (not on this same system!). Alternatively, extract a signature of these files so they can be verified in the future.

Copy /var/lib/aide/aide.db.new.gz to /var/lib/aide/aide.db.gz (the expected name).

[root@server1 ~]# cd /var/lib/aide
[root@server1 aide]# cp aide.db.new.gz aide.db.gz
[root@server1 aide]# cd

== m02p15_baseline_steps_deploy_aide

The following is an example of deploying AIDE on server1.

Install the aide package.
Customize /etc/aide.conf to your liking.
- In the example /etc/aide.conf file shown here, number 1 defines macros that can be used in /etc/aide.conf.
- Number 2 is a configuration directive defining the location of the AIDE database. Note that this example uses a macro defined above.
- Number 3 is a configuration directive defining the location in which aide --init will save a newly created database file.
- Number 4 indicates where the results of aide --check will be reported. Note that multiple locations are allowed.
- Number 5 is a group definition line. Files selected by AIDE in group NORMAL will store information about its regular permissions, inodes, number of links, user and group, size, mtime and ctime, POSIX ACLs, SELinux context, extended attributes, MD5 checksum, RMD160 checksum, and SHA256 checksum.
- And number 6 are selection lines. The first one adds all files under / to be checked in group NORMAL; the second exempts all files in /etc that end in ~ from being checked; the third specifies that all files under /root that start with a period should be checked in group PERMS only. Note that this uses regular expression syntax.
Run /usr/sbin/aide --init to build the initial database. This can take a while as it creates a gzipped-database called /var/lib/aide/aide.db.new.gz
Store /etc/aide.conf, /usr/sbin/aide and /var/lib/aide/aide.db.new.gz in a secure location (not on this same system!). Alternatively, extract a signature of these files so they can be verified in the future.
Copy /var/lib/aide/aide.db.new.gz to /var/lib/aide/aide.db.gz (the expected name).

Baselining: Using AIDE

Verifying System Integrity with AIDE

This next example demonstrates testing file integrity using aide.

Modify a file on your system to be different.

[root@server1 ~]# echo shiny new >> /bin/tcsh

Run /usr/sbin/aide --check to check your system for inconsistencies.

[root@server1 ~]# aide --check
AIDE 0.15.1 found differences between database and filesystem!!
Start timestamp: 2014-12-15 08:22:04

Summary:
  Total number of files:        107530
  Added files:                  9
  Removed files:                0
  Changed files:                10


---------------------------------------------------
Added files:
---------------------------------------------------

... Output omitted ...

---------------------------------------------------
Changed files:
---------------------------------------------------

changed: /usr/bin/tcsh
... Output omitted ...

Results are displayed on standard output and in /var/log/aide/aide.log by default.

If you know about these changes, you can run aide --update to update your database and store it in a secure location again.

References

aide(1) and aide.conf(5) man pages
AIDE Quick Start: /usr/share/doc/aide-*/README.quickstart
AIDE Manual: /usr/share/doc/aide-*/manual.html

== m02p16_baseline_steps_verify_with_aide

This next example demonstrates testing file integrity using aide.

Modify a file on your system to be different. [root@server1 ~]# echo shiny new >> /bin/tcsh
Run /usr/sbin/aide --check to check your system for inconsistencies.
Results will be displayed on standard output and in /var/log/aide/aide.log by default.

If you know about these changes (say after an update or you have edited a file yourself) you can run aide --update to update your database. Do not forget to store it in a secure location again.

Warning: If a system has had its root account or kernel compromised by an attacker, the installed version of AIDE or local copy of the database file may have been modified by the attacker or may respond with false results. In this case, it is a good idea to boot the system with a known good operating system environment with an offline copy of AIDE and a copy of the backed-up AIDE database. Also, you should never rely solely on an IDS (Intrusion Detection System) to check for changes. Some loadable-kernel-module rootkits can trick programs into reading other data than what is actually stored in a file. IDS is an add-on to your existing security framework, but should never be used as your sole security measure.

Baselining: Using `sar`

sar = System Activity Reporter
sar is provided by the sysstat package and does the following:
- Collects information about system activity from the operating system at a particular point in time.
- Takes a sample of data over a selected time period, either once or on some repeating schedule.
- Collected information includes memory usage, disk I/O, network activity, and so on.
There are two modes in which sar operates:
- When sysstat is installed, a cron job is set up that takes a one second sample of system activity every ten minutes and saves it to a file.
  - Use the sar command to read this information.
- Run sar from the command line to collect specific data, averaged over a certain period of time in seconds, a specified number of times.

== m02p17_baseline_use_sar_1

In this section you will learn to configure sar to monitor system performance.

Another useful system monitoring tool is sar, the System Activity Reporter, provided by the sysstat package. What sar does is collect information about system activity from the operating system at a particular point in time. It normally takes a sample of data over a selected time period, either once or on some repeating schedule. The information it collects can have to do with memory usage, disk I/O, network activity, and so on.

There are two modes in which sar operates. When sysstat is installed, a cron job is set up that takes a one second sample of system activity every ten minutes and saves it to a file. The sar command can be used to read this information. Otherwise, you can run sar from the command line to collect specific data, averaged over a certain period of time in seconds, a specified number of times.

Baselining: Using `sar`

Deploying the sar Command

Install the sysstat package. This package provides cron scripts (/etc/cron.d/sysstat) that will gather data automatically.

[root@server1 ~]# cat /etc/cron.d/sysstat

# Run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib64/sa/sa1 1 1
# 0 * * * * root /usr/lib64/sa/sa1 600 6 &
# Generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib64/sa/sa2 -A

The first column of sar output is the time of the recorded statistics.
- To ensure this column is always in a format you can parse, prefix your sar commands with LANG=C to get a unified time format.
- To make this the default for your session, use export LANG=C.
Example sar commands:
- sar -A displays all information collected today.
- sar -u 2 5 displays five samples of system CPU usage spaced 2 seconds apart.
- sar -r displays memory statistics.
- sar -S displays swap space utilization statistics.
- sar -b displays I/O statistics.

To generate useful output, add awk parsing:

[root@server1 ~]# export LANG=C
[root@server1 ~]# sar -r | tail -n+5 | awk '{print $1,$4,$8}'
10:20:01 %memused %swpused
10:30:01 92.28 0.05
10:40:01 92.28 0.05
Average: 92.28 0.05

Reference

sar(1), sa1(8), sa2(8), and sadc(8) man pages

== m02p18_baseline_use_sar_2

Install the sysstat package. This package provides cron scripts (/etc/cron.d/sysstat) that will gather data automatically.

The first column of sar output will always be the time of the recorded statistics. To make sure that this column is always in a format you can understand/parse, it is best to prefix your sar commands with LANG=C to get a unified time format. You can also execute export LANG=C to make this the default for your session.

Here are a few examples of how you can use sar commands:

Run sar -A to display all information collected today.
Run sar -u 2 5 to display five samples of system CPU usage spaced 2 seconds apart.
Run sar -r to display memory statistics.
Run sar -S to display swap space utilization statistics.
Run sar -b to display I/O statistics.

To generate useful output, add awk parsing:

Network Monitoring

Network monitoring measures network activity, and looks for slow or failing servers, routers, switches, or other devices.
There are active and passive monitoring techniques that may involve agents residing on the network equipment that notify or are polled by a network management system.
Many enterprises use network monitoring/management systems and services from CA, HP, IBM, and other vendors.
Nagios is a free open source monitoring tool.
- Provided via the EPEL (Extra Packages for Enterprise Linux) repository from the Fedora project
- Not supported by Red Hat
- Modular system consisting of a core nagios package with additional functionality provided by plug-ins
- Plug-ins can run on local machines to provide information not readily available via the network
- Flexible configuration allows definitions of time periods, admin groups, system groups, and custom command sets
- Web-based interface on main nagios server allows configuring tests and settings for Nagios and hosts it is monitoring

References

== m02p19_network_monitor

In this section you will learn to identify network monitoring alternatives.

Network monitoring involves measuring network activity, and looking for slow or failing servers, routers, switches, or other devices. There are active and passive monitoring techniques that may involve agents residing on the network equipment that notify or are polled by a network management system.

Most medium- and large-scale enterprises have committed to working with network monitoring/management systems and services from vendors like CA, HP, IBM, and others. There are, of course, many open source, GPL-friendly options in this area too, like OpenNMS and Nagios.

Nagios is a great free, open source monitoring tool. It is currently provided via the EPEL (Extra Packages for Enterprise Linux) repository from the Fedora project. Therefore, Nagios is not supported by Red Hat.

Nagios is a modular system, consisting of a core nagios package with additional functionality provided by plug-ins. Plug-ins can be run on local machines to provide information that is not readily available via the network, such as disk space usage, etc. The configuration is very flexible, allowing definitions of time periods, admin groups, system groups, and even custom command sets.

There is a web-based interface on the main nagios server that can be used to configure tests and settings for Nagios itself or for the hosts it is monitoring.

Module Completion

Nice job!

Click the button below to complete this module of the course:

Red Hat Enterprise Linux 7 Troubleshooting

Being Proactive, Part 1

Module Topics

Being Proactive

Being Proactive

Being Proactive

Being Proactive

Being Proactive

Monitoring: Centralized Logging

Monitoring: Centralized Logging

Monitoring: Centralized Logging

Monitoring: Centralized Logging

Monitoring: Hard Drive Failures

Monitoring: Hard Drive Failures

Baselining: Using AIDE

Baselining: Using AIDE

Baselining: Using AIDE

Baselining: Using AIDE

Baselining: Using sar

Baselining: Using sar

Network Monitoring

Module Completion

Baselining: Using `sar`

Baselining: Using `sar`